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Abstract 

This document describes concisely the ubiquitous class of exponential family distributions 
met in statistics. The first part recalls definitions and summarizes main properties and duality 
with Bregman divergences (all proofs are skipped). The second part lists decompositions and re- 
lated formula of common exponential family distributions. We recall the Fisher- Rao- Riemannian 
geometries and the dual affine connection information geometries of statistical manifolds. It is 
intended to maintain and update this document and catalog by adding new distribution items. 
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Part I 

A digest of exponential families 



1 Essentials of exponential families 

1.1 Sufficient statistics 

A fundamental problem in statistics is to recover the model parameters A from a given set of 
observations xi, etc. Those samples are assumed to be randomly drawn from an independent 
and identically-distributed random vector with associated density p(x; A). Since the sample set is 
finite, statisticians estimate a close approximation A of the true parameter. However, a surprising 
fact is that one can collect and concentrate from a random sample all necessary information for 
recovering/estimating the parameters. The information is collected into a few elementary statistics 
of the random vector, called the sufficient statistic^ Figure [1] illustrates the notions of statistics 
and sufficiency. 

It is challenging to find sufficient statistics for a given parametric probability distribution func- 
tion p(x; A). The Fisher-Neyman factorization theorem allows one to easily identify those sufficient 
statistics from the decomposition characteristics of the probability distribution function. A statistic 
t{x) is sufficient if and only if the density can be decomposed as 



p(x;\) =a{x)b x {t{x)), (1) 

where a(x) > is a non-negative function independent of the distribution parameters. The class 
of exponential families encompass most common statistical distributions and are provably the only 
ones (under mild conditions) that allow one for data reduction. 




(data reduction) for recovering A 



Figure 1: Collecting statistics of a parametric random vector allows one to perform data reduction 
for inference problem if and only if those statistics are sufficient. Otherwise, loss of information 
occurs and the parameters of the family of distributions cannot be fully recovered from the (insuf- 
ficient) statistics. 



First coined by statistician Sir Ronald Fisher in 1922. 
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1.2 Exponential families: Definition and properties 

An exponential family is a set of probability distributions admitting the following canonical decom- 
position: 

p(x; 9) = exp ((t(x),6) - F{6) + k(x)) (2) 

where 

• t(x) is the sufficient statistic, 

• 6 are the natural parameters, 

• (., .) is the inner product (commonly called dot product), 

• F(-) is the log-normalizer, 

• k(x) is the carrier measure. 

The exponential distribution is said univariate if the dimension of the observation space X is 
ID, otherwise it is said multivariate. The order D of the family is the dimension of the natural 
parameter space Vq. Part II reports the canonical decompositions of common exponential families. 
Note that handling probability measures allows one to consider both probability densities and 
probability mass functions in a common framework. Consider (X, o, fi) a measurable space (with a 
a a-algebra) and / a measurable map, the probability measure is defined as Pe(dx) = pf(x', 6)n(dx). 
a is often a a-algebra on the Borel sets with fx the Lebesgue measure restricted to X. 

For example, 

• Poisson distributions are univariate exponential distributions of order 1 (e.g., dimAf = 1 and 
diniP = 1) with associated probability mass function: 

Pr(x = fe;A) = ^| r , (3) 

for k £ N. 

The canonical exponential family decomposition yields: 

- t(x) = x is the sufficient statistic, 

- 6 = log A are the natural parameters, 

- F{6) = exp 9 is the log-normalizer, 

- k(x) = — logx! is the carrier measure. 

• ID Gaussian distributions are univariate distributions of order 2 (e.g., dimA' = 1 and dinxP = 
2), characterized by two parameters (fi,a) with associated density 



for x £R. 

The canonical exponential family decomposition yields: 



\{^)\ (4 ) 
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— t(x) = (x,x 2 ) is the sufficient statistic, 

— 9 = (#1,6*2) = — 2^7) are the natural parameters, 

— F(9) = — ^ + \ log ( — ij) i s the log-normalizer, 

— k(x) = is the carrier measure. 

Exponential families [Bro86] are characterized by their strictly convex and differentiable func- 
tions F, called log-normalizer (or cumulant/partition function). The sufficient statistics t(x) : X 1— > 
V® is said minimal if they are affinely independent. The carrier measure is usually the Lebesgue 
(e.g., Gaussian, Rayleigh, etc.) or counting measures (e.g., Poisson, binomial, etc.). Note that 

F{9) = log I exp(< t(x),9 > +k(x))dx. (5) 

J X 

It is thus easy to build an exponential family: fix k(x) = and let us choose for t{x) an arbitrary 
function for a given domain x £ [x m ; n , £ max ]. For example, consider t(x) = x for x £ [—00, 1], then 

F(9) = /.expfccdz = [^]Uoo = W ~ !)• 

By remapping t(x) to y, we can consider without loss of generality regular exponential family 
where the dimension of the observation space matches the parameter space. The regular canonical 
decomposition of the density simplifies to 

p(y-e) = exp((y,9)-F(9)) (6) 

with respect to the base measure h(x) = exp(k(x)). 

Exponential families include many familiar distributions [Bro86] , characterized by their log-normaliser 
functions: 

Gaussian or normal (generic, isotropic Gaussian, diagonal Gaussian, rectified Gaussian or Wald 
distributions, log-normal), Poisson, Bernoulli, binomial, multinomial (trinomial, Hardy -Weinberg 
distribution), Laplacian, Gamma (including the chi- squared), Beta, exponential, Wishart, Dirich- 
let, Rayleigh, probability simplex, negative binomial distribution, Weibull, Fisher-von Mises, Pareto 
distributions, skew logistic, hyperbolic secant, negative binomial, etc. 

However, note that the uniform distribution does not belong to the exponential families. 

The observation/sample space X can be of different types like integer (e.g., Poisson), scalar 
(e.g., normal), categorical (e.g., multinomial), symmetric positive definite matrix (e.g., Wishart), 
etc. For the latter case, the inner product of symmetric positive definite matrices is defined as the 
matrix trace of the product < X, Y >= Tr(XY), the sum of the eigenvalues of the matrix product 
X xY. 

The A:-order non-centered moment is defined as the expectation .EpT fc ]. The first moment is 
called the mean /i and the second centered moment E[(X — [i) T (X — fi)] is called the variance- 
covariance or dispersion matrix. 

For exponential families, we have 

E[t(X)] =[i = VF(9) (7) 
E[(t(X) - n) T (t(X) - /x)] = V 2 F(9) (8) 
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Exponential families 



Non-exponential families 



uniparameter 



Binomial 



Bernoulli 




Uniform 




Cauchy 




Levy skew a-stable 



Multinomial Dirichlet Weibull 



Exponential Rayleigh Gaussian 



Table 1: Partial taxonomy of statistical distributions. 



Notation V 2 F denotes the Hessian of the log-normalizer. It is a positive definite matrix since F 
is strictly convex and differentiable function. In fact, exponential families have all finite moments, 
and F is C°° differentiable. Thus Cauchy distributions are provably not an exponential family, 
since it has no defined mean. Another widely used family of distributions that are not exponential 
families are the Levy skew a-stable distributions. 

In practice, we need to consider minimal exponential family with the sufficient statistics t(x). 
Since they are several ways to decompose a density/probability mass according to the terms 
9,t(x),F(9) and k(x), we adopt the following basic conventions: 

The sufficient statistic t(x) should be elementary functions (often polynomial functions) with 
leading unit coefficient. The carrier measure should not have constant term, so that we prefer to ab- 
sorb the constant in the log-normalizer. Finally, we denote by A the traditional source parameters, 
and by Va the source parameter space (canonical parameter space) . 

The natural parameter space 

V e = {&\ \F(G)\ < +00} (9) 

is necessarily an open convex set. Although the product of exponential families is an (unnormalized) 
exponential family, the mixture of exponential families is not an exponential family. 
The moment generating function of an exponential family {pp(x; 9) \ 8 £ Vq} is: 

m e (x) = exp(F(9 + x)-F(9)) (10) 

Function F is thus sometimes called logarithmic moment generating function [iANOOj (p. 69). 
The cumulant generating function is defined as 

Ke{x) =\ogm e {x) = F{6 + x)-F{6) (11) 

Exponential families can be generated from Laplace transforms [Ban07]. Let H{x) be a bounded 
non-negative measure with density h(x) = expk(x). Consider the Laplace transform: 

L(9) = y exp(< x,9 >)h(x)dx (12) 
Since L(9) > 0, we can rewrite the former equation to show that 

p(x; 9) = exp(< x,0 > -log L{9)) (13) 

is a probability density with respect to the measure H{x). That is, the log-normalizer of the 
exponential family is the logarithm of the Laplace transform of the measure H(x). 

1.3 Dual parameterizations: Natural and expectation parameters 

A fundamental duality of convex analysis is the Legendre-Fenchel transform. Informally, it says 
that strictly convex and differentiable functions come by pairs. 

For the log-normalize F, consider its Legendre dual G = F* defined by the "slope" transform 

G(rj) = sup <9, V > -F{9). (14) 
0eV e 
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Original parameters 
AeA, 



e g e 



Exponential family 
dual parameterization 

Lcgentlro transform 



(e.F)^(H.F-) 



V = V e F(9) 

Natural parameters 



6 = V v F*(rj) 

Expectation parameters 



Figure 2: Dual parameterizations of exponential families from Legendre transformation. 

The extremum is obtained for i] = VF(8). r] is called the moment parameter, since for expo- 
nential families we have rj = E[t(X)] = fi = VF(8). Gradients of conjugate pairs are inversely 
reciprocal VF* = (VFp 1 , and therefore F* = fVF* = /(VF) -1 . 

Thus to describe a member of an exponential family, we can either use the source or canonical 
natural/expectation parameters. Figure [2] illustrates the conversions procedures. 

1.4 Geometry of exponential families: Riemannian and information geometries 

A family of parametric distributions {p(x;8)} (exponential or not) may be thought as a smooth 
manifold that can be studied using the framework of differential geometry |Lau87] . We review two 
main types of geometries: (1) Riemannian geometry defined by a bilinear tensor with an induced 
Levi-Cevita connection, and non-metric geometry induced by a symmetric affine connection. 

Cenco\i proved [Cen72] (see also |Leb05| and [GSUT] for an equivalent in quantum information 
geometry) that the only Riemannian metric that "makes sense" for statistical manifolds is the 
Fisher information metric: 



1(0) 



dlogp(x; 9) d\ogp(x; 8) 



08,, 



89 j 



p(x; 8)dx 



Vji. 



The infinitesimal length element is given by 

d d 



^ 2 = EE d C v2 w 



(15) 



(16) 



i=l i=l 



Cencov proved that for a non-singular transformation of the parameters A = /(#), the informa- 
tion matrix 



7(A) 



1(8) 



(17) 



is such that ds 2 (A) = ds 2 (8). Equipped with the tensor 1(8), the metric distance between two 
distributions on a statistical manifold can be computed from the geodesic length (e.g., shortest 
path): 



"also written as Chentsov 
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D(p(x;9 1 ),p(x;9 2 )) 



mm 

6{t) | e(o)=e u e(i)=e 2 




(18) 



Rao's geodesic distance is invariant by non-singular transformations. The multinomial Fisher- 
Rao-Riemannian geometry yields a spherical geometry, and the normal Fisher-Rao-Riemannian 
geometry yields a hyperbolic geometry K V97. CSS05J. Indeed, the Fisher information matrix for 
univariate normal distributions is 

i A 



1(9) = -s 



(19) 



The Fisher information matrix can be interpreted as the Hessian of the Shannon entropy: 



-E 



8 2 logp(x; 9) 



8 2 H(p) 
89,89 '■ : 



(20) 



with H{p) = — f p(x; 9) \ogp(x; 9)dx. 

For an exponential family, the Kullback-Leibler divergence is a Bregman divergence on the 
natural parameters. Using Taylor approximation with exact remainder, we get KL(0||0 + d9) = 
\d9 T V 2 F(9)d9. Moreover, the infinitesimal Rao distance is y / d9 T I(9)d9 for 1(9) = V 2 F(9). We 
deduce that D(9, 9 + d9) = y/2KL(e\\e + d9). 

The inner producid of two vectors x and y at a tangent space of point p is 

< x,y > p = x T g p y. (21) 

The length of a vector v in the tangent space at T p at p is defined by \\v\\ p = yj< v, v > p . 

For exponential families, the logarithm of the density is concave (since F is convex), and we 
have 



1(9) 



8r] 
89 



V 2 F(9) = r 1 ^) 



89 
8r] 



(22) 



That is, the Fisher information is the Hessian of the log-normalizer V 2 F(9). 

To a given Riemannian manifold A4 with tensor G, we may associat^] a probability measure as 
follows: Let p(9) = y -y/det G(9) with overall volume V = J ee @ \J G(9)d9. These distributions are 
built from infinitesimal volume element and bear the names of Jeffreys priors. They are commonly 
used in Bayesian estimation i.WOOl |KV97| . 

Furthermore, in an exponential family manifold, the geometry is flat and 9/rj are dual coordinate 
systems. 

Amari [iANOO] focused on a pair of dual affine mixture/exponential connections V m and V e 
induced by a contrast function F (also called potential function). 

A connection V yields a function JT that maps vectors in any pair of tangent spaces T p 
and T q . An affine connection is defined by d? coefficients. Amari's investigated thoroughly the 



3 Technically speaking, it is a bilinear symmetric positive definite operator. A tensor [Tp]^ of covariant degree 2 
and of contravariant degree 0: < •, ■ > p : T p x T p — > R. 

4 Or view/interpret the manifold as a statistical manifold. 
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a-connections and showed that V^, the metric Levi-Civita connection is obtained from the dual 
and V* - ") a-connections: 



7(0 ) _ V (a) +V ( " a) 



(23) 



2 

In particular, the connection is called the exponential connection and the V^ 1 ) connection 
is called the mixture connection. The mixture/exponential geodesies for two distributions p(x) and 
q(x) induced by these dual connections are defined by 

lx (x) = (1 - X)p(x) + Xq(x) (24) 

log 7a (^) = (1 - A)logp(x) + Xlogq(x) -\ogZ x , (25) 

with Z\ the normalization coefficient Z\ = J f( 1 ~ x \x)g x (x)dx. The exponential connection can 
be equivalently rewritten as: 

7a(*) = ^P(^) (1_A) + aq(x) X (26) 

The dual connections are also called conjugate connections. 

The canonical divergence on exponential family distributions of these dually flat statistical 
manifolds is shown to be: 

D{p(x-e l )\\ P {x-e 2 ) = f{9 1 ) + f*{9 2 )- <e um > (27) 

This canonical divergence can be rewritten as a Bregman divergence on the natural parameter 
space: 

D(p(x-e l )\\p{x-e 2 )) = b f (0i||0 2 ) = m) - f(6 2 )- <e 1 - e 2 ,vF(e 2 ) > (28) 

Furthermore, we have Bf(0i\\0 2 ) = Bp* (rj 2 \ \r]i), where F* is the Legendre conjugate of F, and 
th = VF{6i). 

The Kullback-Leibler divergence on two members of the same exponential family is equivalent 
to the Bregman divergence of the associated log-normalizer on swapped natural parameters: 

KL(p(z;0i)||p(z;0 2 )) = / p(x; Oi) log ^f^\dx = B F (9 2 \\6 1 ) (29) 

J P(x;V 2 ) 

Note that the Bregman divergence may also be interpreted as a generalized relative entropy. 
Indeed, 

KL(p\\q)=H >< (p\\q)-H(p), (30) 

with 

H(p) = [ p(x) log -Ldx (31) 
the Shannon entropy, and the cross-entropy H x (p\\q) > H(p): 

H x (p\\q)= I p(x)\og^-dx (32) 
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Indeed, let H F (p) = —F(p) denote the generalized entropy, and 

H*(p\\q)=-F(q)-<p-q,VF(q)> (33) 

the generalized cross-entropy. The Bregman divergence can be considered as a generalized relative 
entropy: 

B F (p\ \q) = H*{p\\q) - H F {p) (34) 

Bregman divergences are the canonical divergences of dually flat Riemannian manifolds [iANOO] . 
Bregman divergences extend naturally the quadratic distances (squared Euclidean and squared 
Mahalanobis distances), and generalize the notions of orthogonality, projection, and Pythagoras' 
theorem. 

The Bregman projection q 1 - of a point q onto a subspace W is the unique minimizer of B F (w\ \q) 
for w G W: 

q ± = aig mm Bp (w\\q) (35) 

This is a right-side projection. The left-side projection is simply a right-side projection for the 
Legendre conjugate generator F*. 

The following remarkable 3-point property hold for any arbitrary Bregman divergence: 

B F (p\\q) + B F {q\\r) = B F (p\\r)+ < p - q,VF(r) - VF(q) > . (36) 

This formula may be interpreted as the law of generalized cosines. 
The proof follows by mathematical rewriting: 

B F (p\\q) + B F (q\\r) = F(p) - F(q)- < p - g,V F(q) > +F(q) - F(r)- < q - r, V F(r) > 

= B F (p\\r)+ <p-r, V-F(r) > + < r, VF(r) > - < p, VF(q) > + < q, V-F(g) - VF(r) > 
= B F (p\\r)+ <p- qi VF(r) -VF{q) > 



Geodesic (pq) is orthogonal to dual geodesic (rq). 

Indeed, choosing r = q ± , we end-up with a generalized Pythagoras' theorem: 

B F (p\\q) + B F (q\\r) = B F (p\\r) (37) 

That is, triangle p, q, r is a "right-angle" triangle. The V-geodesic (pq) is orthogonal to the dual 
V*-geodesic (qr). The inner product < p — q,VF(r) — X7F(q) > vanishes. Note that the notion 
of orthogonality is not commutative. Euclidean geometry is the special case of self-dual flat spaces 
with commutative orthogonality obtained for F(x) = \x T x. 

Banerjee et al. [BMDG05] formally proved the duality between exponential families and Breg- 
man divergences for regular exponential families using Legendre transform: 

logpHx; 0) =< t(x),0 > -F(6) + k(x) = -B F *(t(x)\\VF(9)) + F*{x) + k(x) (38) 

* v ' 

=kp(x) 

This duality reveals key for designing an expectation-maximization algorithm using soft Breg- 
man clustering. The proof further reveals that 
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X F C domF*. (39) 

That is, the space of observations for the exponential families with log normalizer F is included in 
the domain of the Legendre conjugate function. 

For two very close points p and q — > p, the Kullback-Leibler divergence is related to the Fisher 
metric by 

KL(p\\q)^h 2 (p,q) (40) 

For exponential families, we can thus easily recover the Fisher information metric from the 
corresponding derivatives of the Bregman divergence. 

However, it is difficult to compute Rao's distance J \J d9 T V 2 F(9)d9 since it requires to compute 
the anti-derivative of \fV 2 F{9). See for example, the work [RO03| that computes numerically an 
approximation of the Rao's distance for gamma distributions. 

1.5 Statistical inference 

1.5.1 Maximum likelihood estimator 

Given n i.i.d. observations x±, ...,x n sampled from a given exponential family pp(x;9), the maxi- 
mum likelihood estimator recover the parameter of the distribution by maximizing 

n 

6 = argmax JJpf(^; 9). (41) 
i=i 

It follows that the maximum likelihood estimator can be obtained from the center of mass of 
the sufficient statistics (the observed point): 

^= vF *(^i>(^)) ( 42 ) 

The variance of any unbiased estimator cannot beat the Cramer-Rao bound: 

var(#) > r 1 (9) (43) 

1.5.2 Bayesian inference and conjugate priors 

The family of conjugate priors of an exponential family with likelihood function: 

(n n \ 

< e,J^t(si) > -nF(0) + J2 k (xi)\ (44) 
i=l i=l / 

is 

exp(< e,g > -vF(0)). (45) 
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1.5.3 Statistical distances for exponential families 

Exponential families admit simple expressions for the entropies and related divergences [NN11] . We 
summarize below the formula expressions, starting with Renyi and Tsallis D^(p : q) divergences 
(relative entropies): 

D R (p:q) = J J^tIl, (46) 

fljfr:,) _ = ^-1 . (47) 

a — 1 

KL(p:q) = lim , D*(p : g) = lim l£(p : g) = B F (9' : 0), (48) 

a— >1 a— 5>1 

where 

J Fa (# : 0') = a F(0) + (1 - a)F(9') - F(a9 + (1 - = J F ,i- a (9' : 0) (49) 

is the skew Jensen divergence. Since the Renyi divergence for a = 5 is related to the Bhattachar- 
rya coefficient B(p,q) and Hellinger distance H(p,q), this also yields closed- form expressions for 
members of the same exponential family: 

B(p,q) = e - J ^ im , (50) 



H(p,q) = VI -e J ^ ((M " ) . (51) 

Renyi H^(p F (x; 6)) and Tsallis entropies H^(p F (x; 9)) , including Shannon entropy H(p F (x; 9)) 
in the limit case, can be expressed respectively as 

H*(p F (x;9)) = I L-^F(a9)-aF(9) + logE p [e^- 1 ^}) (52) 

H T a ( PF (x-9)) = -J- ((/M)-^})^ [e (-^W] _ 1) (53) 
1 — a V / 

H(p F (x;9)) = F^-^VF^))-^^)] (54) 

The Shannon cross-entropy H x (p F (x;9) : p F (x;9')) is given by 

x (p F (x; 0) : p F (x; 9')) = F{9') - (9', VF(9)) - E p [k(x)] (55) 

Thus these entropies admit closed-form formula whenever the normalizing carrier measure is 
zero (k(x) = 0): 

H^( PF (x;9)) = -±—(F{a9)-aF{9)) (56) 
1 — a 

Hl(p F (x;9)) = J—L^e)-aF { e)_ l \ (57) 
1 — a V / 

H( PF (x;9)) = F(9)- (9,VF(9)) (58) 

Note that statistical distances are invariant by a monotonic function and sufficient statistics re- 
parameterization, and thus are expressed only using the log-normalizer F and the natural parameter 
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1.5.4 Mixture of exponential families 



Consider a mixture of exponential families of fc-components 

k 

p F (x;9i, ...,9k) = y^ j w i p F (x;9 i ), (59) 



8=1 

with Yli=i w i = 1 an d an w i > 0- Mixture of exponential families include the Gaussian mixture 
models (GMMs), mixtures of Gamma distributions, mixture of zero-mean Laplacians, etc. To get a 
random sample from a mixture, we first draw randomly a number in [0, 1] to select the component, 
and then draw a random sample from the selected exponential family member. 

To fit a mixture model to a set of n independently and identically distributed (i.i.d.) observations 
xi,...,x n , we use the general expectation-maximization procedure [DLR77] . 

Initialization. We first compute a Breg nicin hard, clustering on the n observations xi, x n to get 
a collection of k clusters. Let n« denote the number of points in the ith cluster, and let x^ 
denote the points of the cluster for j = 1, m. For each cluster, we initialize the Wi = ^ to 
the proportion of points inside the cluster, and estimate the expectation/natural parameters 
r}i/9i using the observed point. That is, = ^- Y^j=i t( x i(j)) an d in the dual coordinate 
system: 0> = VF" 1 ^ £™=i t(x i{j) )). 

Expectation step. 

u = w j e x P(- D G(^(^)lhj))) exp(fc(gj)) 
Ya=i ™l exp -D G (t(xi)\ \r]i)) exp(k(xi)) 

with the Bregman divergence for the Legendre conjugate G = F* defined as: 

D G (p\ \q) = G(p) - G(q) -(p-q, VG(q)) (61) 
We simplifjH the terms G(t(xi)) in the numerator /denominator to get 

, , Wj exp(G(r ?j ) + (t( Xi ) - rij.VGfe))) . . 

K ] E?=iwieMG(m) + (t(x l )-m,yG( m )}) { ' 

Maximization step. 

N 



3 N 



T?X>(*,j) ( 63 ) 



i=l 



m JSMM (64) 

Given two mixtures of exponential families, we can bound the relative entropy of these distri- 
butions using Jensen's inequality on the convex Kullback-Leibler divergence as follows: 
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This step is crucial since G(t(x)) may not be denned. Consider for example, the univariate Gaussian distribution, 
l 

2 



We have G(tj) = — | log (77? - 772) with ti(x) = x and ti(x) = x 2 . Thus G(t(x)) = log(x 2 — a; 2 ) is not defined 
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i=l i=l i=l j=l i=l j=l 

(65) 



This bound is far too crude to be useful in practice. 

We may consider approximating the relative entropy by matching components of the mixture, 
and get the following approximation: 



In practice, the unscented transform yields a better approximation to the Kullback-Leibler 
divergence. We consider 2dk sigma points as follows: For each component of the first mixture /, 
we decompose the Hessian Tq = V 2 F(9) into 2d points y (dTgJk, such that \J {dTg)k denote &;-th 
column of the matrix square root of Tq. Then the Kullback-Leibler divergence is approximated at 
the 2dk sigma points by 



i=l j=i 

where g is the probability density of the second mixture. 

Even if it is widely known that GMMs can approximate arbitrarily finely any smooth probability 
density function, it is in fact possible to model any smooth density with a single member of an expo- 
nential family by defining the scalar product < •, • in a reproducing kernel Hilbert space [CS06J 
(RKHS) %. Thus exponential families in RKHSs are universal density estimators [ASH04]. 

Software library 

The jMEF is a Java library implementing the hard/soft/hierarchical techniques for exponential 
families with respect to sided and symmetrized Bregman divergences [NN09]: 




(66) 




(67) 



http : //www . lix . polytechnique . f r/~nielsen/MEF/ 



14 



Part II 

Exponential families: Flash cards 



2 Univariate Gaussian distribution 



PDF expression 


/(*;^ 2 ) = ^exp(-^) forxeE 


Kullback-Leibler divergence 


/ 2 2 \ 

£kl(/pII/q) = M 21og a + + - 1 


MLE 




Source parameters 


A = Gu,<7 2 ) £KxR+ 


Natural parameters 


© = e 2 ) eExr 


Expectation parameters 


H = (771,772) £RxE+ 


A -»• 


©=(£."E*) 


-»• A 


A = (~m> ~m) 


A -> H 


H=(^ 2 + M 2 ) 


H -> A 


A = (vi,V2-vl) 


-»• H 


H = VF(0) 


H -»• 


= VG(H) 


Log normalizer 


ne) = -& + iiog(-^) 


Gradient log normalizer 


vne) = (-ft,-i + ^) 


G 


G(H) = -±log ( r? 2- ??2 )+C 


Gradient G 


VG(H)-f -^-^i ) 


Sufficient statistics 


t(x) = (x, X 2 ) 


Carrier measure 


fc(z) = 
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3 Univariate Gaussian distribution, a 2 fixed 



PDF expression 


f(x; fx, a 2 ) - exp (-^#) for x G R 


Kullback-Leibler divergence 


^KL(/P||/Q) = 


MLE 


A = 


Source parameters 


A = n e R 


Natural parameters 


= 6* G R 


Expectation parameters 


H = 77 G R 


A -»• 


= 4 


-»• A 


A = #a 2 


A -»• H 


H = n 


II A 


A = 7] 


-»• H 


H = VF(0) 





= VG(H) 


Log normalizer 




Gradient log normalizer 


VF(0) = a 2 6 


G 


G(U) = £ + C 


Gradient G 


VG(H) = £ 


Sufficient statistics 


t(x) = X 


Carrier measure 
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4 Multivariate Gaussian distribution 



PDF expression 


/(*; fi, £) = (2 , )d/ ^ |1/2 exp (- ( ^ } S 2 for i6E 


Kullback-Leibler divergence 


/ / 1 \ / \ \ 

DklUpUq) =Hlog(£ft)+tr(s Q 1 Sp)) 

+ 1 ((t*Q - Mp) T Sq 1 (/xq - Mp) - rf) 


MLE 


A = I Er=i ^ s = \ ££=ite - A)0* - A) T 


Source parameters 


A = (/i, S) with n e R d and S >- 


Natural parameters 


e = (<9,e) 


Expectation parameters 


H=(r?,i7) 


A -> 


= (s-V,^s- 1 ) 


-»• A 


A = (ie-^,|e- 1 ) 


A -»• H 


H= (/x,-(£ + W T )) 


II A 


A= (r?,-(i7 + W T )) 


-»• H 


H = VF(0) 





= VG(H) 


Log normalizer 


F(0) = itr(G- 1 00 T ) - \ log det 9 + ^logn 


Gradient log normalizer 


vf(0) = (10-^,-ie- 1 - \(e- 1 e)(e- 1 o) T ) 


G 


G(H) = -^log(l + 7] T H- 1 7 1 ) - ±logdet(-#) - | log(27re) 


Gradient G 


VG(H) = + W T )- 1 r ? ,-^(^ + W T )- 1 ) 


Sufficient statistics 


i(x) = (x, —xx T ) 


Carrier measure 


k(x) = 
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5 Multivariate isotropic Gaussian distribution 



We assume identity variance-covariance matrix S = Id. 



PDF expression 


= , 2 L/ 2 exp ( O^*-")) for xeK d 


Kullback-Leibler divergence 


-Dkt f f p 1 fo) = ^>(mo — Mp) t (^o — Mp) 


MLE 


1 V^™ 
' ?2 * — *t — -L 


Source parameters 


A = /j with jU £ IR rf 


Natural narameters 


= 


Flxnprta.tioTi naramptprs 


H = 77 


A — t vv 


U — /i 




A — ft 


A -)• H 


H = /i 


H A 


A = 77 


-r H 


H = VF(0) 


H -»• 


= VG(H) 


Log normalizer 


F(ft) = -6 T 9 + -W27r 


Gradient log normalizer 


VF(6>) = (9 


G 


G{rj) = F(6) = Wv + i log2vr 


Gradient G 


VG(t?) = 7/ 


Sufficient statistics 


= a; 


Carrier measure 


= —\x~ x x 
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6 Poisson distribution 



PDF expression 


/(s;A) = A * CX P| ( ~ A) forx€M+ 


Kullback-Leibler divergence 


D K L(fp\\f Q ) = \Q-\p (l+log(^)) 


MLE 




Source parameters 


A = A G E+ 


Natural parameters 


9 = 9 eR 


Expectation parameters 


H = t] el+ 


A -»• 


= log A 


-»• A 


A = exp 6 


A -»• H 


H = A 


H -»• A 


A = Tj 


-»• H 


H = VF(0) 


H -»• 


= VG(H) 


Log normalizer 


F(0) = exp 6 


Gradient log normalizer 


VF(0) = exp# 


G 


G(H) = T] log 7] — 7] + C 


Gradient G 


VG(H) = log?? 


Sufficient statistics 


= X 


Carrier measure 


k(x) = — log (a;!) 
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7 Centered Laplacian distribution, ji = 



PDF expression 


/( s;<7 ) = ^exp(-M) forxGE 


Kullback-Leibler divergence 


AcL(/p||/g) = bg(S) + 2 ^ a 


MLE 


1 1 1 


Source parameters 


A = cr G E+ 


Natural parameters 


= er 


Expectation parameters 


H = 7/ el+ 


A -»• 




-»• A 


A = -J 


A -»• H 


H = <7 


II A 


A = 77 


-»• H 


H = VF(0) 





= VG(H) 


Log normalizer 


F(0) = log(-§) 


Gradient log normalizer 


VF(0) = -I 


G 


G(H) = - log r/ + C 


Gradient G 


VG(H) = -i 


Sufficient statistics 


i(x) = X 


Carrier measure 


fc(a;) = 
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8 Bernoulli distribution 



PDF expression 


f(x;p) = p x (l - p) 1 x for x G {0, 1} 


Kullback-Leibler divergence 


/ \ / ft \ \ 

AcL(/l||/2)=IOg(^)- P llog(gj}^) 


MLE 


1 s~^n 
P — n 2^i=l x i 


Source parameters 


A = p G [0,1] 


Natural parameters 


= 6 G R+ 


Expectation parameters 


H = n G [0, 1] 


A -»• 


© = log (t^) 


-»• A 


A — cxp6» 
J1 l+cxp6» 


A -»• H 


H =p 


II A 


A = Tj 


-»• H 


H = VF(0) 





= VG(H) 


Log normalizer 


F(0) = log (1 + exp 0) 


Gradient log normalizer 


= 


G 


G(H) = log( T ^)r / -log( T ^)+C 


Gradient G 


VG(H) = log ( T ^_) 


Sufficient statistics 


= X 


Carrier measure 


fc(x) = 
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9 Binomial distribution, n fixed G IN" 4 " 



PDF expression 


f(x;n,p) = xl{ nL x yP x (l ~p) n x where x G 1N+ 


Kullback-Leibler divergence 


£»kl(/i ||/ 2 ) = n(l - Pl ) log (i^i) + n Pl log (g) 


MLE 


p = h Si=i x « 


Source parameters 


A = p G [0, 1] 


Natural parameters 


= # G R 


Expectation parameters 


h = v g R+ 


A -»• 


© = log (t^) 


-»• A 


A — cxp6» 
J1 l+cxp6» 


A II 


H = np 


H -»• A 


A = 2 


-»• H 


H = VF(0) 


H -»• 


= VG(H) 


Log normalizer 


F(0) = n log(l + exp 0) - log(n!) 


Gradient log normalizer 


= 


G 


G(H) = ^log(^)-nlog(^)+C 


Gradient G 


VG(H) = log (^) 


Sufficient statistics 


= X 


Carrier measure 


k(x) = — log(x!(n — x)\) 
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10 Multinomial distribution, n fixed 



PDF expression 


f(xi 


> ■ ■ ■ > ^fc! Pi-i ' 




for Xj G IN+ 


Kullback-Leibler divergence 




D K L(fa\\U) - 


= n Pct)k log - n 


logf^ 


MLE 


ft. — Hi 


Source parameters 




A = (pi,- 


■■ ,p k ) G [0,l] fc with J2iPi 


= l 


Natural parameters 







= (0i,--- A-i) g^ 1 




Expectation parameters 




H = 


= (771, - • • ,m-i) e [o.n]*- 1 




A -»• 


e = ('^fe)). 


-»• A 




A = < 


K - ^„ fc P i 1 a if Kk 

l+E J= i expOj 
m 1 
Pk — . . v^fc-i Q 
K 1+E.i=i cxp6»j 




1 \. 7 XX 


H = (np i ) i 


TT A 

H — > A 




A = 


fpi = f if i < fc 




-»• H 


H = VF(0) 


H -»• 


= VG(H) 


Log normalizer 




F(@) = 


"log (1 + Yh=1 exp6»ij - log 




Gradient log normalizer 


^(®> = ( 1+ ££?:u), 


G 


G(H) = 


(ELi »fc log 


Vi) + (n- Yli=i Vi) log ( n " 


- THzl Vi) + c 


Gradient G 


VG(H)^(log 


Sufficient statistics 


i(x) = (xi, • • • 


Carrier measure 


M^) = -Etilogx,! 
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11 Rayleigh distribution 



PDF expression 


f(x;a 2 ) = -^exp(-|^) 


Kullback-Leibler divergence 


/_2 \ _2 _2 

AcL(/p||/o) = iog(g)+^a 


MLE 


/ 1 V^™ 2 
O- = V 2^ Ei=l ^ 


Source parameters 


A = a 2 e E+ 


Natural parameters 


= fler 


Expectation parameters 


H = r? G E+ 


A -»• 


e = -^ 


-»• A 


A = -^ 


A II 


H = 2a 2 


H -»• A 


A = 2 

YV 2 


-»• H 


H = VF(0) 


H -»• 


= VG(H) 


Log normalizer 


F(0) = -log(-20) 


Gradient log normalizer 


VF(0) = -I 


G 


C(r/) = - log 77 


Gradient G 


VG(r?) = -i 


Sufficient statistics 


t(x) = x 2 


Carrier measure 


k(x) = logx 
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12 Gamma distribution 

T(k) = (k — 1)! and ¥(x) = ^ 



P 1 I H ovnvocci mi 
1 1 y 1 expi coMUll 


X 

/(z,A,fcJ - x Afer(fc) 


TV 111 IT "1 ■ 1" 

Kullback-Leibler divergence 


A<l(/p||/q) = log \ p y; + (k P k Q ){^{k P ) log A P ) + k P Q Xp 




lno-t - \il(k) — lntr/'i V n rA-iV n Wt- 


bource parameters 


A = (A, fc) 


AT j_ l j_ 

Natural parameters 


© = (k-l,-j) 


Expectation parameters 


TT 

1 1 = 


A -»• 


= 


->■ A 


A = 


A TT 

A II 


TT 


TT , A 

H — 7- A 


A 

A = 


/•-"v TT 

(-) —r H 


TT 

H = 


TT > fCl 

rl ? W 


vy — 


Log normalizer 


F(0) = log r(0i + 1) + (0i + 1) log w 


Gradient log normalizer 


VF(0) = (*(6>i + 1) + log e2 , ^ ) 


G 


G(H) non-closed form 


Gradient G 


VG(H) non-closed form 


Sufficient statistics 


= (x, log x) 


Carrier measure 


k(x) = 
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13 Beta distributions 


r(k) = 


(fc-1)!, B{a,P) 


_ r(a)r(/3) , <v( ) _ T'(x) 
~ r(a-bS) ana - r(x) 


PDF expression 


f(x;a,/3) = ^x a - 1 (l-xf- 1 


Kullback-Leibler divergence 


DklUpUq) = log|gfg| - (A, - /3 P )*(ap) - (ft, - 
Pp)V(Pp) + {a Q -ap+'pQ- /3p)^(a P + a Q ) 


MLE 


< 


' ^ i =-B(a,/3)EC = iIogx i 
^i = B(a,/3)Er=ilog(l-^) 


Source parameters 


A = («,/3) 


Natural parameters 


= 


Expectation parameters 


H = 


A -> 


= 


-»• A 


A = 


A II 


H = 


H -»• A 


A = 


-»• H 


H = 


H -»• 


= 


Log normalizer 


F(&) = log B(6 1 + 1,02 + 1) 


Gradient log normalizer 


VF(0) = (tf (0i + 1) - ^(0i + 2 + 2), ^(0 2 + 1) - ^(0i + 02 + 2)) 


G 


G(H) non-closed form 


Gradient G 


VG(H) non-closed form 


Sufficient statistics 


t(x) = (logx,log(l - x)) 


Carrier measure 


k(x) = 
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